-
Notifications
You must be signed in to change notification settings - Fork 790
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use hash code instead of full source text for language service caches #5944
Conversation
let AreSameForParsing((fileName1: string, source1: string, options1), (fileName2, source2, options2)) = | ||
fileName1 = fileName2 && options1 = options2 && source1 = source2 | ||
let AreSameForParsing((fileName1: string, source1Hash: int, options1), (fileName2, source2Hash, options2)) = | ||
fileName1 = fileName2 && options1 = options2 && source1Hash = source2Hash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends how accurate the string hash codes are and how much we care about very odd behaviour on collisions.
Checking only hash codes means that when there is collision we will re-use results from some other random parse or typecheck.
So if collisions happen very rarely this may just be ok.
If hash codes are ever based on the partial hash of the string (e.g. a prefix) then this shouldn't be done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should definitely have a think about this. At least initially, I feel that if the file names are the same and the hashes are the same, it's highly unlikely we'd get some arbitrary cached value. But let's definitely make sure we feel confident on that :)
I don't really trust string.GetHashCode, I doubt it was designed with strong collision resistance in mind. If a true cryptographic hash like sha is too expensive, can you maybe at least use a specialised one like murmurHash? (Sorry, can't sleep) |
I think what's worth thinking about is if this is a scenario where strong collision resistance is important. For example, in each cache here, the key varies:
And so on. All current keys are an aggregate of some information. Given that, does it make sense to use a cryptographically strong hash function? Is the Birthday Problem applicable for IDE tooling? Would doing so be meaningful perf-wise? Not entirely sure how to answer each question, but we definitely do want to move away from storing the whole source text if we don't actually need it. |
I think I often sound too negative. Hashes are awesome and this pr is great. String.GetHashCode is 32 bytes, so 4 giga possibilities. Really large source files can be many kB to maybe even MB if they are auto-generated? I was never that good in statistics, and am just talking out of my behind because I'm bored. But that's a few magnitudes too few for me to be comfortable ;D But I would reverse that question: is there any reason NOT to use a strong hash? Typechecking is probably so expensive that running a sha256 over the source wouldn't be noticeable. And if it is, there is still murmur3, which is a fast, specialised, 128 bit hash. |
So I'm pretty sure my stats needs some work (it's been a while!), but here's a quick calculation in the awesome Fable repl. So for 10k files, we could see a ~1.1% chance of a collision with string's GetHashCode, assuming we only use that as a key for a cache. Given that we aggregate additional information I think it's fine, but I wouldn't take that as gospel 🙂 That said, I certainly don't see anything wrong with a stronger hash function if the perf isn't a problem. We'd have to measure that, but if it's negligible then I think something like MD5 is fine. I'd much rather use something in the BCL than having to hand-code some F# implementation of something in here. |
So, second commit is uhhh...a lot more. But this addresses all sources of source files being used as keys in the caches. Let me make sure this PR is marked as WIP so we don't ever accidentally pull it before really vetting the approach. |
.NET MD5 and SHA-1 implementations have approximately same performance, SHA-256/384/512 are way slower, so I'd use SHA-1 or murmur3, if it's fast. |
@cartermp Is the second commit necessary, what are we saving by hashing source file names? |
Ah, yeah they're just the path+names (even though they're not named that way). I'll be rid of the second commit. |
Closing this as the PR, #6001, makes source text equality abstract for caching. If you are using the string implementation of |
A path forward for caching will be interesting. Ideally we might want to use an |
Not sure if this is the correct overall approach - but hopefully this can help encourage a better fix 🙂
The following issues track large source files ultimately living on the large object heap, forcing horrible GC delays over time:
#5937
#5936
#5935
#4881
But as mentioned in #4881 (comment) the language service uses the full source text as a key for caches, making it more difficult to simply not pass around the full source text. This seeks to rectify that. It may also help the linked issues